3 research outputs found
MOSRA: Joint Mean Opinion Score and Room Acoustics Speech Quality Assessment
The acoustic environment can degrade speech quality during communication
(e.g., video call, remote presentation, outside voice recording), and its
impact is often unknown. Objective metrics for speech quality have proven
challenging to develop given the multi-dimensionality of factors that affect
speech quality and the difficulty of collecting labeled data. Hypothesizing the
impact of acoustics on speech quality, this paper presents MOSRA: a
non-intrusive multi-dimensional speech quality metric that can predict room
acoustics parameters (SNR, STI, T60, DRR, and C50) alongside the overall mean
opinion score (MOS) for speech quality. By explicitly optimizing the model to
learn these room acoustics parameters, we can extract more informative features
and improve the generalization for the MOS task when the training data is
limited. Furthermore, we also show that this joint training method enhances the
blind estimation of room acoustics, improving the performance of current
state-of-the-art models. An additional side-effect of this joint prediction is
the improvement in the explainability of the predictions, which is a valuable
feature for many applications.Comment: Submitted to Interspeech 202
Speaker Embeddings as Individuality Proxy for Voice Stress Detection
Since the mental states of the speaker modulate speech, stress introduced by
cognitive or physical loads could be detected in the voice. The existing voice
stress detection benchmark has shown that the audio embeddings extracted from
the Hybrid BYOL-S self-supervised model perform well. However, the benchmark
only evaluates performance separately on each dataset, but does not evaluate
performance across the different types of stress and different languages.
Moreover, previous studies found strong individual differences in stress
susceptibility. This paper presents the design and development of voice stress
detection, trained on more than 100 speakers from 9 language groups and five
different types of stress. We address individual variabilities in voice stress
analysis by adding speaker embeddings to the hybrid BYOL-S features. The
proposed method significantly improves voice stress detection performance with
an input audio length of only 3-5 seconds.Comment: 5 pages, 2 figures. Accepted at Interspeech 202
BYOL-S: Learning Self-supervised Speech Representations by Bootstrapping
Methods for extracting audio and speech features have been studied since
pioneering work on spectrum analysis decades ago. Recent efforts are guided by
the ambition to develop general-purpose audio representations. For example,
deep neural networks can extract optimal embeddings if they are trained on
large audio datasets. This work extends existing methods based on
self-supervised learning by bootstrapping, proposes various encoder
architectures, and explores the effects of using different pre-training
datasets. Lastly, we present a novel training framework to come up with a
hybrid audio representation, which combines handcrafted and data-driven learned
audio features. All the proposed representations were evaluated within the HEAR
NeurIPS 2021 challenge for auditory scene classification and timestamp
detection tasks. Our results indicate that the hybrid model with a
convolutional transformer as the encoder yields superior performance in most
HEAR challenge tasks.Comment: Submitted to HEAR-PMLR 202